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Expression microarrays are commonly used to study transcriptomes. Most of the arrays are now based on oligo-nucleotide 
probes. Probe design being a tedious task, it often takes place once at the beginning of the project. The oligo set is then 
used for several years. During this time period, the knowledge gathered by the community on the genome and the 
transcriptome increases and gets more precise. Therefore re-annotating the set is essential to supply the biologists with 
up-to-date annotations. SigReannot-mart is a query environment populated with regularly updated annotations for dif- 
ferent oligo sets. It stores the results of the SigReannot pipeline that has mainly been used on farm and aquaculture 
species. It permits easy extraction in different formats using filters. It is used to compare probe sets on different criteria, to 
choose the set for a given experiment to mix probe sets in order to create a new one. 

Database URL: http://sigreannot-mart.toulouse.inra.fr/ 



Project description 

Updating annotations 

The main step of the microarray creation process is the 
probe design. The design tools aim at maximizing the 
probe sequence specificity while conserving the largest pos- 
sible covered transcript or gene set. The genome and tran- 
scriptome sequences used as references during the design 
evolve with each new assembly or new gene build, modify- 
ing the link between the probes and the corresponding 
biological entities. As nearly, all the annotations related 
to a probe are based on this link, the annotation interpret- 
ation could become hazardous just a few months after the 
design, especially for organisms with an unfinished genome 
or a partially known transcriptome, like many farm animal 
and aquaculture species. 

Consequently, researchers will need updated annota- 
tions based on the best possible probe-transcript link. This 
is the first goal of the SigReannot-mart database. 



Re-assessing probe specificity 

A specificity indicator is produced by the pipeline (1) and 
stored for each probe, this criteria can be used to evaluate 
the specificity of a probe and its chance of multiple tran- 
scripts crosshybridization. It is based on the number and the 
similarity scores of the blastn (2) alignments between a 
probe and a transcript or a genomic location. 

Like in OligoRAP (3) and IMAD (4) pipelines, probes are 
assigned to different Target Specificity Classes (Table 1), 
based on the amount and type of hits (5). 

The user can decide the level of probe specificity he 
wants to focus his analysis on. The specificity indicator is 
of high interest for biologists when they start interpreting 
their list of over- and under expressed genes. For the anno- 
tation of unspecific probes, only a more thorough analysis 
will help to understand which of the biological entities is 
the source of the signal. This is why SigReannot-mart pro- 
vides complementary specificity subcategories that use hit 
location and description (Table 1). 
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Table 1. Probe target specificity classes (TSCs) and subclasses 



TSCs 


Description 


1 


One good hit and no noise 


2 


One good hit with noise 


3 


No hit, one noise >30bp 


4 


No hit, one noise >20 and <30bp 


5 


No hit many noises 


6 


No good hit no noises 


7 


Many good hits 


7.1 (subclasses) 


Many good hits but one entity 


MH (subclasses) 


Multiple hits on one chromosome 


MC (subclasses) 


Hits on multiple chromosomes 



These quality indicators use a Blastn similarity search on the 
transcriptome of the studied species. As an illustration, the users 
can decide to reject the probes with many hits (Category 7) or 
not, using a complementary fine grain subcategory indicator 
(7.1, MH, MC). EnsembI provides probe set mapping but do not 
provides TSCs and stores only probes with one genomic hit and no 
more than one mismatch. 



Providing rich annotation using multiple data sources 

Usually nnicroarray nnanufacturers provide annotation files 
for their oligo sets. But they do not up-date these files with 
each new version of the genonne annotation produced by 
Ensennbl or the NCBI. These annotation files often contain 
linnited infornnation such as gene nanne, genonnic location 
and Sonne Gene Ontology (GO) ternns. To help scientists in- 
terpret nnicroarray expression data, sigReannot-nnart inte- 
grates nnultiple connplennentary data sources covering 
external references, orthologs genes and pathways. 
External reference and orthologous genes are of high inter- 
est for the extraction of pathway infornnation fronn sources 
such as David (6) or IPA (7). 

Comparing array designs 

When a biologist needs to choose between different avail- 
able array designs, a connnnon connparison criterion is the 
transcriptonne coverage of the probe set (8,9). It can be 
evaluated by extracting the lists of pathways, GO ternns, 
transcripts and gene IDs related to each set. But such a 
task tends to be tedious when the oligo sets are not anno- 
tated using the sanne nnethods and data sources. By giving 
access to shared standardized nnicroarray annotations, 
sigReannot-nnart addresses this need. 

Designing new oligo sets from existing ones 

Thanks to new printing techniques, it has been quite 
connnnon for a nunnber of years to build custonn gene 
expression nnicroarrays, even for a single scientist or a re- 
search teann. The design step being a harsh tinne-consunning 
task, the strategy to build these custonn nnicroarrays is often 



to select probes fronn different existing platfornns. This 
strategy also pernnits to verify the expression range of 
each oligo-nucleotide in the available data sets. 
SigReannot-nnart can be used to sinnplify this task: the 
probes conning fronn different sets, sharing the sanne anno- 
tation process, can easily be selected, nnerged and conn- 
pared to all the predicted transcripts of a studied species 
in order to generate the new set, with better transcriptonne 
or specific nnetabolic pathway coverage. 

Evolution of the annotation between EnsembI and 
RefSeq versions 

The database contains the annotation of a probe set for 
different versions of Ensennbl and RefSeq, which have 
proved to be both connplennentary and helpful for the in- 
terpretation of nnicroarray gene expression data (10). 

Data content of SigReannot-mart 

Probes are central in the nnart table structure. A probe can 
be linked to different probe sets (Table 2). Using alignnnent 
results, each probe will be provided with an Ensennbl-gene 
link specificity nnark which can evolve with new genonne 
assennblies or annotations. Through the Ensennbl API, 
based in the gene link, we fetch the orthologous genes 
for several species always including hunnan and nnouse as 
well as GO and crossreference gene identifiers. 

Then, using orthologous HGNC identifiers and the KEGG 
files, we extract KEGG orthologs (KO) and pathways related 
to the probe. GO diagrann or enrichnnent analysis can easily 
be perfornned using the text ouput fornnats. The HTML 
output fornnat links the identifiers with the corresponding 
web pages (Table 3) fronn the KEGG, Ensennbl, HGNC or 
Annigo web sites. 

Customized BioMart environment 

SigReannot-nnart innplennents BioMart (11) version 0.7. 
For users unfanniliar with the BioMart query interface 
(Figure 1), pre-fornnatted annotations files are directly 
downloadable fronn the repository web page. For a quick 
overview of each data set annotation update, a sunnnnary 
and statistical report is also provided. 

A new data extraction fornnat, the data nnatrix type, has 
been added to the BioMart Attribute page, pernnitting to 
analyze the diversity of gene categories such as KEGG path- 
ways. This fornnat generates a Boolean nnatrix indicating 
the nnennbership of the probes versus pathways. This type 
of nnatrices is connnnonly used in R/Bioconductor (12). 

Query examples 

To illustrate the functions of sigReannot-nnart, we present 
here two case studies. 
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Table 2. Summary of data currently available in the SigReannot-mart database 



Microarray 


Species 


Manufacturer 


Data set 
EnsembI 56 


EnsembI 59 + 
RefSeq RNA 


EnsembI 61 + 
RefSeq RNA 


44 K 


Bovine 


Agilent 


* 


* 


* 


24 K 




EADGENE 


* 






22 K 




INRA 


* 






44 k 


Chicken 


Agilent 


* 


* 


* 


20 K 




EADGENE 


* 




* 


44 K 


Horse 


Agilent 


* 


* 


* 


GPL2881 


Mouse 


Agilent 






* 


GPL2877 


Rat 


Agilent 






* 


44 K 


Pig 


Agilent 


* 


* 


* 


25 K 




EADGENE 


* 






17K 




INRA 


* 






44 K 


Rabbit 


Agilent 


* 


* 


* 


44 K 


Salmon 


Agilent 


* 






15K 


Sheep 


Agilent 


* 


* 


* 


37 K 


Trout 


Agilent 


* 


* 




GPL884 


Human 


Agilent 






* 



The frequency of update of the probe set annotation follows the EnsembI update, at least two times a year. The current probe sets are 
not available in EnsembI. 

Asterisks correspond to the annotation version of the probe sets. 



Table 3. External databases referenced from SigReannot-mart 



Data source 


Genes Transcripts Pathways GO Gene 

terms symbols 


Orthologs 


URL 


Entities description 


EnsembI 


* * 


* 


www.ensembl.org 


Gene, ncRNA, mRNA, putative 
RNA and orthologuous genes 


RefSeq 


* 




http://www.ncbi.nlm.nih 
.gov/RefSeq/ 


Transcript 


Gene Ontology 


* 




http://www.geneontology 
.org/ 


GO term 


HGNC 


* 




http://www.genenames 
.org/ 


Gene symbol 


KEGG 


* 


* 


http://www.genome.jp/ 
kegg/ 


Enzyme, pathways and ortholog 
groups 



Asterisks represent the data sources corresponding to each biological entities imported in SigReannot-mart to perform the annotation 
process. 



Case 1: probe specificity study 

A biologist wants to check if a given probe set contains 
probes for a list of genes he wants to nnonitor and how 
specific these probes are. 
For this, he uses three criteria: 

• the probe set nanne 

• the gene nannes 

• the specificity (Table 1) 



Step 1: select the Ensennbl version in the database 
drop-down list. 

Step 2: select the probe set in the datasets drop-down list. 
Step 3: filter the probes using the IDs of the genes of 
interest. 

Step 4: filter the probes using the specificity category 
1 and 2 of the genes of interest. 
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probe sequences 



7 



oligo set properties 



SiGREANWOT-pipeline - 



SIGREANNOT- mart 



generic mart template 



xml mart confEguration file generation 



I 



mart database creation and data import 

T 



automatic mart configuration using template 


1 


f E 



B 



sponsored 
by 



SIGREANNOT-MART M icroArray Annota^^^^^ ^^^^^^^ 



I 



O Hew I B Count 



#U RL I S XML I 9 Perl 



Data set 

bta uru s_a g tl e rti_44k( bQs_ta u tj ^ Pf obe 



Please restrict your query usln^ criteria below 



[None selected] 
AttritHftes 

[None selected] 



H Specif icy 
□ Sp^lficy category 



□ Specificy sub-category 



1: 1 good hit and no nois e 
2: 1 good hit with noi^e 
3: no hit with 1 noise >- lObp 
4; no hit with 1 noise >= 20bp 
5: no hit many noises 



& noise < SObp H 



Mft multiple hits on one chromosome iJ. 

MC hits on mvkli pie chro mos o nies 

7. 1 1 mar^ g ood N ts but ocie entity ^ 



a Gene 

e Chromosome 
s Orthoiogues 
@ Blast results 
B XREF 



Figure 1. Annotation pipeline, BioMart integration and SigReannot-mart query interface. The management of the probe anno- 
tation processing pipeline and the biomart environment are centralized and automatized to allows efficient biomart configur- 
ation for multiple data sets with limited human intervention. The BioMart database is directly created and populated at the end 
of the annotation pipeline (A), then the BioMart configuration is automatically generated (B) using an XML file created from a 
generic template (C) and probe set properties (D). The SigReannot-mart data set can be filtered by user queries from a web page 
(E). Many attributes can be used as filters like probe specificity. Gene hits, chromosome hit location or orthologs. 
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Table 4. Species related to SigReannot-mart pre- 
sent data sets 



Species 



Category 



Cow 
Chicken 
Horse 
Pig 

Rabbit 

Sheep 

Salmon 

Trout 

Mouse 

Rat 

Human 



Farm 



Fishery 
Model 



BloMart query sunnnnary: 

Database Sigenae oligo annotation 

Dataset btaurus_agilent_44k (bos_taurus) 

Filters Probe : [ID-list specified]. Category: 1, 2 

Attributes Probe name, gene name, specificity category 

Analyzing the nunnber of records representing probes 
for the given gene list, the biologist can decide whether 
or not to use this probe set. 

Case 2: custom microarray design 

A biologist wants to create a new probe set using existing 
probes and probes designed for Ensennbl transcripts having 
no probes in the existing sets. 

Step 1: Select all chicken transcript using the Ensennbl 
BioMart. 

Step 2: Select all chicken transcripts referenced by probes. 

BioMart query sunnnnary: 
Database EnsembI Gene at ensembl.org 
Data set Gallus gallus genes (WASHUC2) 
Filters 
Attributes 
Database 

Datasets 

Filters 
Attributes 



EnsembI Transcript ID 

Sigenae oligo annotation at 
SigReannot-mart.toulouse.inra.fr 

ggallus_agilent_44k(sus_scrofa) and ggallus_ 
eadgene_20k (sus_scrofa) 

EnsembI transcript ID ([ID-list specified) 

EnsembI Transcript ID 



All Ensennbl transcripts found in the first query and not in 
the second one are not related to any probes and represent 
valuable target sequence to design new probes for a 
custonn probe set design. 



Discussion and future directions 

While SigReannot-nnart is nnainly used for gene expression 
nnicroarray probe set quality re-assessnnent and 
re-annotation (13), it can also be used to facilitate parts 
of the probe design process. 

Gene expression nnicroarrays are still widely used and 
therefore it is innportant to go on with the re-annotation 
process. New probe designs are still conning out and probe 
selection is currently going on: nnaking these tasks easier 
contributes to the effort of offering accurate tools to nnoni- 
tor gene expression. Today, the SigReannot-nnart data pro- 
cessing is industrialized enough for us to think about a 
user's interface pernnitting to add a new probe set by up- 
loading a PASTA file and indicating the corresponding spe- 
cies. The users would be able, a few hours later, to query 
the resulting annotations. Alternatively, the PASTA files of 
public probe sets received by ennail can already be pro- 
cessed, even if the related aninnal species are not currently 
supported (Table 4). Another even sinnpler option would be 
to schedule the annotation updates for existing probe sets. 
These two features should be nnade available in the near 
future. 
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